Diacritization for Real-World Arabic Texts

نویسندگان

  • Emad Mohamed
  • Sandra Kübler
چکیده

For Arabic, diacritizing written text is important for many NLP tasks. In the work presented here, we investigate the quality of a diacritization approach, with a high success rate for treebank data but with a more limited success on realworld data. One of the problems we encountered is the non-standard use of the hamza diacritic, which leads to a decrease in diacritization accuracy. If an automatic hamza restoration module precedes diacritization, the results improve from a word error rate of 9.20% to 7.38% in treebank data, and from 7.96% to 5.93% on selected real-world texts. This shows clearly that hamza restoration is a necessary step for improving diacritization quality for Arabic real-world texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe ...

متن کامل

Smoothing methods for a morpho-statistical approach of automatic diacritization Arabic texts (Méthodes de lissage d'une approche morpho-statistique pour la voyellation automatique des textes arabes) [in French]

We present in this work a new approach for the Automatic diacritization for Arabic texts using three stages. During the first phase, we integrated a lexical database containing the most frequent words of Arabic with morphological analysis by Alkhalil Morpho Sys which provided possible diacritization for each word. The objective of the second module is to eliminate the ambiguity using a statisti...

متن کامل

SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts

This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-b...

متن کامل

Diacritization: A Challenge to Arabic Treebank Annotation and Parsing

Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts ...

متن کامل

Arabic Diacritization in the Context of Statistical Machine Translation

Diacritics in Arabic are optional orthographic symbols typically representing short vowels. Most Arabic text is underspecified for diacritics. However, we do observe partial diacritization depending on genre and domain. In this paper, we investigate the impact of Arabic diacritization on statistical machine translation (SMT). We define several diacritization schemes ranging from full to partial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009